Sparse group factor analysis for biclustering of multiple data sources
نویسندگان
چکیده
MOTIVATION Modelling methods that find structure in data are necessary with the current large volumes of genomic data, and there have been various efforts to find subsets of genes exhibiting consistent patterns over subsets of treatments. These biclustering techniques have focused on one data source, often gene expression data. We present a Bayesian approach for joint biclustering of multiple data sources, extending a recent method Group Factor Analysis to have a biclustering interpretation with additional sparsity assumptions. The resulting method enables data-driven detection of linear structure present in parts of the data sources. RESULTS Our simulation studies show that the proposed method reliably infers biclusters from heterogeneous data sources. We tested the method on data from the NCI-DREAM drug sensitivity prediction challenge, resulting in an excellent prediction accuracy. Moreover, the predictions are based on several biclusters which provide insight into the data sources, in this case on gene expression, DNA methylation, protein abundance, exome sequence, functional connectivity fingerprints and drug sensitivity. AVAILABILITY AND IMPLEMENTATION http://research.cs.aalto.fi/pml/software/GFAsparse/ CONTACTS : [email protected] or [email protected].
منابع مشابه
GFA: Exploratory Analysis of Multiple Data Sources with Group Factor Analysis
The R package GFA provides a full pipeline for factor analysis of multiple data sources that are represented as matrices with co-occurring samples. It allows learning dependencies between subsets of the data sources, decomposed into latent factors. The package also implements sparse priors for the factorization, providing interpretable biclusters of the multi-source data.
متن کاملMinimax Localization of Structural Information in Large Noisy Matrices
We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relev...
متن کاملBiclustering Sparse Binary Genomic Data
Genomic datasets often consist of large, binary, sparse data matrices. In such a dataset, one is often interested in finding contiguous blocks that (mostly) contain ones. This is a biclustering problem, and while many algorithms have been proposed to deal with gene expression data, only two algorithms have been proposed that specifically deal with binary matrices. None of the gene expression bi...
متن کاملRectified factor networks for biclustering of omics data
Motivation Biclustering has become a major tool for analyzing large datasets given as matrix of samples times features and has been successfully applied in life sciences and e-commerce for drug design and recommender systems, respectively. actor nalysis for cluster cquisition (FABIA), one of the most successful biclustering methods, is a generative model that represents each bicluster by two sp...
متن کاملBiclustering via sparse singular value decomposition.
Sparse singular value decomposition (SSVD) is proposed as a new exploratory analysis tool for biclustering or identifying interpretable row-column associations within high-dimensional data matrices. SSVD seeks a low-rank, checkerboard structured matrix approximation to data matrices. The desired checkerboard structure is achieved by forcing both the left- and right-singular vectors to be sparse...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Bioinformatics
دوره 32 16 شماره
صفحات -
تاریخ انتشار 2016